Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment

نویسندگان

چکیده

Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar human perception that relates aural information. In this work, we present a method for self-supervised representation learning based on spatial alignment (AVSA), more sophisticated task than (AVC). addition correspondence, AVSA also learns location of acoustic content. Based 360$^\text{o}$ video Ambisonics audio, propose selection objects using object detection, beamforming signal towards detected objects, attempting learn sound they produce. We investigate use features represent input, different formats: Ambisonics, mono, stereo. Experimental results show 10 $\%$ improvement first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; object-oriented crops brings significant performance increases action recognition downstream task. A number audio-only tasks are devised testing effectiveness learnt feature representation, obtaining comparable state-of-the-art methods scene classification ambisonic binaural audio.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Bimodal Structure in Audio-Visual Data

A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in...

متن کامل

Comparing the Impact of Audio-Visual Input Enhancement on Collocation Learning in Traditional and Mobile Learning Contexts

: This study investigated the impact of audio-visual input enhancement teaching techniques on improving English as Foreign Language (EFL) learnersˈ collocation learning as well as their accuracy concerning collocation use in narrative writing. In addition, it compared the impact and efficiency of audio-visual input enhancement in two learning contexts, namely traditional and mo...

متن کامل

Learning words from natural audio-visual input

We present a model of early word learning which learns from natural audio and visual input. The model has been successfully implemented to learn words and their audio-visual grounding from camera and microphone input. Although simple in its current form, this model is a rst step towards a more complete, fully-grounded model of language acquisition. Practical applications include adaptive human-...

متن کامل

Cortical Plasticity of Audio–Visual Object Representations

Several regions in human temporal and frontal cortex are known to integrate visual and auditory object features. The processing of audio-visual (AV) associations in these regions has been found to be modulated by object familiarity. The aim of the present study was to explore training-induced plasticity in human cortical AV integration. We used functional magnetic resonance imaging to analyze t...

متن کامل

Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data

Audio Word2Vec offers vector representations of fixed dimensionality for variable-length audio segments using Sequenceto-sequence Autoencoder (SA). These vector representations are shown to describe the sequential phonetic structures of the audio segments to a good degree, with real world applications such as query-by-example Spoken Term Detection (STD). This paper examines the capability of la...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Journal of Selected Topics in Signal Processing

سال: 2022

ISSN: ['1941-0484', '1932-4553']

DOI: https://doi.org/10.1109/jstsp.2022.3180592